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Abstract 

Many current paradigms for acoustic event detection (AED) are not 
adapted to the organic variability of natural sounds, and/or they assume 
a limit on the number of simultaneous sources: often only one source, or 
one source of each type, may be active. These aspects are highly unde¬ 
sirable for applications such as bird population monitoring. We introduce 
a simple method modelling the onsets, durations and offsets of acoustic 
events to avoid intrinsic limits on polyphony or on inter-event temporal 
patterns. We evaluate the method in a case study with over 3000 zebra 
finch calls. In comparison against a HMM-based method we find it more 
accurate at recovering acoustic events, and more robust for estimating 
calling rates. 


1 Introduction 

Acoustic event detection (AED) is useful for various purposes, such as security 
monitoring, wildlife monitoring, and music transcription [TJ m 131 0]. Many 
approaches to AED assume that the acoustic scene is monophonic —having no 
simultaneous or overlapping events—which is unrealistic but useful for some 
applications. Approaches which allow for polyphonic scenes are more flexible, 
but often assume that each stream is different in kind, allowing one monophonic 
stream for each class of event considered [illS]. They may also assume a fixed 
number of simultaneous streams, for example when source separation is applied 
as a first step and then each separated channel is treated as a monophonic scene 

In this paper we explore approaches for AED in cases with an unknown 
number of similar sources. As an example, consider a sound recording in which 
a flock of birds can be heard calling, all of the same species. This is representa¬ 
tive of practical scenarios in which AED might assist ecologists or conservation 
organisations wishing to estimate the total number of individuals detected in a 
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recording, or alternatively the total number of calls in a recording. Both such 
“point counts” are used (in most cases manually detected) for monitoring trends 
in breeding populations [4]. 

Note the specific information need: for overall event counts, which is distinct 
from the “transcription” task in which the objective is to recover a list of the true 
events. An exact transcription would itself give us an exact value for the event 
count; but an imperfect transcription may be an inefficient or biased route to 
count estimation. In the abstract, event counting tasks are regression problems, 
and may not require an estimated event transcript at any point. Also important 
is that we wish to avoid placing limits or inappropriate biases on the number 
of events that can be simultaneously active. Methods that assume monophonic 
sequences per event class are likely to be inappropriate, and other methods may 
bias estimation due to assumptions implicit in their models. With this in mind, 
we briefly consider some event detection paradigms in previous work. 

To decompose an audio scene, one strand of research uses non-negative ma¬ 
trix factorisation (NMF), in particular convolutive NMF which allows events 
to have spectro-temporal structure [5]. However, these models are inflexible 
about the temporal evolution within an event, depending on good matching of 
spectro-temporal templates. This is particularly problematic for sounds with 
much inherent within-class variability such as animal calls. 

Hidden Markov models (HMMs) have been used in various systems for acous¬ 
tic event detection (e.g. mm- These can allow for variation in the temporal 
evolution of events. However a typical HMM corresponds to a monophonic 
model of events; extensions such as the factorial HMM extend this to a specific 
fixed number of parallel sources, and thus retain strong limits on the level of 
polyphony. In one example of polyphonic adapatation of HMM tracking, 
train and apply a standard HMM for event detection, where in their case each 
state corresponds to a class of event. To achieve polyphonic detection they per¬ 
form multiple Viterbi decoding passes: after each Viterbi pass, the states used 
are taken out of consideration for the subsequent passes. In this way a tran¬ 
scription is obtained which allows multiple event classes to occur in parallel. 
However it does not allow multiple simultaneous events of the same class, and 
retains the fixed limit on polyphony. 

In Sectionj^we will describe an alternative way to adapt a HMM to multiple 
detection scenario. However, that is primarily as a point of comparison against 
the main model we wish to explore here, which uses an onset-duration-offset 
model of acoustic events to allow for unbounded polyphony. We describe this 
method in the next section. Then we will describe our alternative HMM method, 
before evaluating both methods using a dataset of bird calls. 

2 Onset-duration-offset model 

Physiological studies indicate that biological auditory processing involves early- 
stage “edge detectors” having separate auditory detection units for onsets and 
for offsets, both in humans [10] and in songbirds m- The information from 
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Figure 1: Schematic diagram of onset-duration-offset event detection model. 


these detectors is then combined in later processing for cognition of “auditory 
objects” or events. Although there is no requirement for our computational sys¬ 
tems to mimic the organisation found in nature, this suggests that a processing 
strategy starting with onset and offset detectors and combining their outputs 
may be fruitful. We can combine onsets and offsets with other information to 
yield posterior beliefs about the events observed (Figurej^. If these components 
are based on the characteristics of individual events, and not the temporal re¬ 
lationships between events, we should be able to design a system that imposes 
few constraints or biases on the observable event patterns. 

The scheme just presented assumes that the onset and offset characteristics 
are the reliable, relatively invariant characteristics of the events of interest. 
However, it makes no strong assumptions on event durations, nor even the 
signal content in the middle of the event, thus allowing for organic variability. 
It also makes no strong assumptions on the temporal occurrence patterns, and 
in particular the level of polyphony is unbounded: at any particular time, if the 
system overall has detected k more onsets than offsets, then the current number 
of parallel active events is k. 

Our approach is probabilistic: we will use onset/offset detectors that yield 
detection probabilities at each time point, and our prior beliefs about event dura¬ 
tion will be expressed as a distribution over durations. To combine these prob¬ 
abilities together, we characterise acoustic events in a two-dimensional space 
indexed by onset time t and duration t. We characterise the conditional prob¬ 
ability of an event at some point in that space as 

Pevt{t,T\y) OC Pon{t\y) Pos{t + T|y)pdur(r) 

where y is the observation (the audio signal). The conditional probabilities Pon 
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Figure 2: Schematic diagram of how probabilities are combined in the onset- 
duration-offset event detection model. Each source of probabilistic information 
is a marginal with respect to a different direction in the [onset x duration] space. 


and poff come from the detectors, and pdur is our duration prior. When dealing 
with discretised time (as we do here), this Bernoulli model imposes a mild 
constraint: no two events can have exactly the same onset time and duration. 
This constraint is very mild, since events can co-occur in our scheme as long as 
they have slight mutual differences in onset time or duration. (We will impose 
a slightly stronger constraint to recover an event transcript, described shortly.) 

Each of our probabilistic sources of information (onsets, offsets, durations) 
gives us information that acts as a one-dimensional marginal, when considered 
in our two-dimensional [onset x duration] space (Figure [^. Note particularly 
that offset detection probabilities are translated into [onset x duration] space 
with an off-axis influence, since offset time is equivalent to onset time plus du¬ 
ration. In this space, we assume that the three types of probability information 
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are conditionally independent and multiply them together to produce a pos¬ 
terior “intensity” over possible events (Figure [^. This could be thresholded 
to give a polyphonic event sequence, or marginalised to give posterior beliefs 
about onsets and offsets. Such posterior beliefs will be related to the raw on¬ 
set/offset detections but refined using the other information sources. Note that 
the posterior is not a probability distribution: it does not sum up to one over 
our 2D space. It represents a set of binomial probabilities; the sum over the 2D 
posterior gives the expected number of events. 

In the implementation we use here, for onset and offset detectors we use 
random forest regression (cf. [H]). Our detectors take spectrogram patches as 
input, the spectrograms having been treated with background noise reduction 
by median-thresholding, and then first differencing in time. The trained random 
forest outputs detection probabilities for each patch. Since the regression makes 
an independence assumption for adjacent (overlapping) patches, the outputs are 
liable to be correlated in time; to reduce this effect, after training the random 
forest we then train an ordinary least squares regression from a sliding window 
of 11 outputs from the detector onto the ground truth, to recover “sharper” 
detections. To implement our prior on event durations, we will train a Gaussian 
mixture model (GMM) on the durations observed in training data. 

We note some resemblances between our approach and that of [H]. Those 
authors also use random forest regression as a recognition component that con¬ 
tributes towards an eventual event segmentation. However, their method is 
fundamentally different in that its elements for recognition are not the on¬ 
sets/offsets, but the frames “within” an event, which have been augmented 
with pointers to their associated onset/offset. For this and other reasons their 
approach is limited to monophonic event detection. 

In the following we will refer to our onset-duration-offset model as ODO for 
short. 

2.1 Recovering an event transcript vs. event counts 

To recover a definite set of events, our 2D posterior can be thresholded using 
a threshold determined during the training phase. In practice, however, we 
observe that this tends to yield a large number of duplicated events, since any 
particular ground-truth event of duration r will often be detected with a rel¬ 
atively strong probability for duration r -|- 1, r — 1, etc., each of which is a 
separate position in our 2D posterior. This effect can be reduced by imposing 
further assumptions: perhaps about the maximum polyphony, or the temporal 
pattern. In the present work we wish to avoid imposing assumptions that might 
strongly bias event counts and the like. We choose to impose an assumption 
which is partly implicit in the detectors themselves: the assumption that only 
one onset, and one offset, may happen in any time frame. This assumption 
may bias detection in very dense audio recordings, but for many densities en¬ 
countered in practice this assumption holds almost always. Hence to recover an 
event transcript, we keep only the events whose posterior probability is stronger 
than all other events with matching onset time or offset time. From these events 
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we then select a threshold for discarding low-probability events. 

Taking Figure]^ as an example, in the posterior we see that the density due 
to the second detected onset overlaps in 2D space with two possible offsets. In 
this case we would keep no more than one of these events, namely that which 
yields the strongest probability. 

A straightforward way to count events is to count the number of items in¬ 
cluded in an event transcript. However, as just described, producing an event 
transcript requires making “hard” thresholding decisions, discarding some in¬ 
formation from the posterior. We can avoid the transcription step and simply 
use the sum of the posterior, over the time range of interest, as the expected 
event count in that time range. We will use this in our evaluation. 


3 Hidden Markov model for event count 

As an alternative to the method presented above, we also describe a hidden 
Markov model (HMM) method for event detection. As described in Section 
HMM approaches to event detection impose limits on the possible polyphony, 
and also may bias the durations and timing patterns of detected events. With 
those caveats acknowledged, we wish to use the HMM as a point of comparison 
since it is in widespread use. So in order to detect events of a single type, 
but with potential polyphony, we apply a HMM but where the hidden states 
are not simply ‘on’ and ‘off’, but the count of currently-active events, i.e. the 
count of events that have started but not yet ended. The state space is thus 
{0,1, 2...K} where K is the maximum number of simultaneous events observed 
in the training data. 

For modelling the observations, we train a separate Gaussian mixture model 
(GMM) with ten components for each cardinality. Again we use spectral patches 
to train this model, but without differencing them in time since in this case we 
aim to model states rather than transitions. 

To recover an event transcription from this HMM, we perform Viterbi de¬ 
coding. From the decoded sequence of cardinalities, we deduce onset and offset 
times, and we associate onsets and offsets with each other in order of occurrence. 
This transcript is then also used for event counting. 

3.1 Combining the two models 

The ODO and HMM models we have described offer two very different ap¬ 
proaches to event detection. We note that it is possible to combine the two, as 
follows. We can expand the HMM state space to include not only the current 
event cardinality, but also two binary indicators of whether the current frame 
includes an onset and/or an offset. This expands the set of HMM states by a 
factor of four. Not all state transitions are possible: e.g. a change in cardinality 
from 3 to 4 can only occur when the onset state is 1. We do not impose such 
limits manually but allow the system to learn them from the transitions seen in 
training data. 
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Even with this expanded state space, HMM-based models inherit the limita¬ 
tions already mentioned: cardinalities higher than seen in the training data will 
not be correctly detected, and the HMM may bias timing patterns. However 
in the following evaluation we will compare the empirical characteristics of the 
methods we have described. 


4 Evaluation 

We recorded a set of four female zebra finches (Taeniopygia guttata) in an 
indoor aviary. The birds exchanged contact calls at a rate of approx 40 calls 
per minute. Birds were recorded for extended periods, and their calls were 
transcribed as event sequences. Transcription was performed separately by two 
human annotators, whose annotations were combined automatically, and any 
discrepancies resolved by the first author. 

This recording setup was designed for use in various studies; in the present 
paper we use it as a case study for event detection. We took two 30-minute 
recording sessions, recorded with the same birds but on separate days, and used 
these two sessions for two-fold crossvalidation. The 30-minute sessions contained 
1663 and 1770 annotated calls in total. Here we use single-channel omni mic 
recordings of the sessions. 

The true polyphony in the original recordings ranged from zero to four. In 
order to investigate heavier densities, as well as to investigate the effect of den¬ 
sity mismatches between training and test data, for each 30-minute session we 
also created an artihcial 10-minute mixture with the three 10-minute segments 
superimposed. All experiments thus used the same set of calls, but in some cases 
the training or test data was “folded” down to a denser 10-minute recording by 
superimposition. 

As in other event detection evaluations ms, for evaluating event tran¬ 
scription we use the F-measure metric and we consider an event to be correctly 
recovered if the onset matches within a fixed tolerance and the duration matches 
within 50% of the true duration. The typical event duration in this data was 
approx 100 ms, so we chose ±25 ms as our tolerance. 

Separately, for evaluating event counts we divide the data into ten-second 
windows and measure the RMS error between the true and estimated number 
of events for each window. Note that both systems do exhibit some miscalibra- 
tion, in that their estimated counts even on the training data could exhibit a 
multiplicative deviation from the truth. For the ODO system we believe this 
is largely due to the independence assumption already mentioned in the under¬ 
lying detectors, and thus more sophisticated edge detectors might remedy this. 
To account for this most basic aspect of miscalibration, during training of all 
systems we used the training data to choose a multiplicative calibration factor 
to apply to all event counts. Calibration did not make use of test data. RMS 
error statistics are reported from the calibrated outputs. 

We tested the following five conhgurations of event detector: 

• The full ODO system of Section 
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• The ODO system with a flat duration prior rather than the GMM. This 
constrained events to lie within a reasonable duration but did not learn 
a distribution on the duration, and allowed us to probe the extent of the 
benefit provided by the GMM duration prior. 

• The HMM system of Section 

• The combined ODO+HMM system of Section [3T| 

• The raw output from the onset-detector component. This can be evaluated 
for event counts only, not transcription, but indicates the extent to which 
the detector component contributes to ODO performance. 

Specific implementation details were as follows. Audio was recorded at 96 
kHz, and spectrograms calculated with frame length 2048, 50% overlap and 
Hann windowing. Spectral information outside the range 0.5-20 kHz was dis¬ 
carded. Noise reduction was implemented using spectral median-subtraction, 
with the median calculated in a sliding ten-second window for each spectral 
band. Spectral patches for the onset/offset detectors were taken from the time- 
differenced spectrogram, of size five frames before and five frames after the 
frame under consideration. The detectors were implemented using random for¬ 
est regression from the sklearn module, with 20 trees. Spectral patches for the 
HMM-GMM modelling were taken from the non-time-differenced spectrogram, 
of size five frames after the frame under consideration. 

Results for event transcription (Figure show a number of tendencies. 
Firstly the full ODO model consistently outperforms the ODO model with flat 
duration prior, indicating that the learned prior on event durations adds useful 
information. Secondly, the HMM system generally performs much worse than 
the ODO system. This poor performance might be attributed to various points 
of difference between the two systems, and so it is interesting to observe that 
the combined ODO-fHMM system improves on HMM but does not approach 
ODO’s strong performance, despite making use of ODO’s onset/offset output 
as part of its input observations. 

The experiments with mismatched density in training and test give a general 
performance degradation for all systems, indicating that there is still some way 
to go to perform detection robust to very wide variation in event density. The 
HMM-based systems perform relatively well in the experiment with increased 
training density—an exception to the general pattern. Gonversely, the poor per¬ 
formance of the HMM-based systems in the experiment with increased testing 
density matches expectations since the training did not encompass all the event 
cardinalities found in the testing data. 

In best conditions, our ODO system achieved an F-measure of around 37%. 
Although quite distant from the ideal of 100%, it is on a similar scale as the 
results reported for state-of-the-art methods for related tasks mm- 

Results for event counting (Figure]^ show a slightly different picture. Again, 
the mismatch in training and testing conditions has a general negative impact on 
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Figure 3: Event transcription results (F-measure), averaged over the two cross- 
validation folds. Error bars cover the range of results obtained within individual 
folds. 


performance. In the main experiment, most of the systems perform at very sim¬ 
ilar quality levels. This is except for the raw onset detector output, included for 
comparison, which performs notably worse than the full systems —illustrating 
that the ODO method performs much better than its underlying detector. How¬ 
ever, although the HMM and ODO-I-HMM systems achieve similar performance 
as ODO in the main experiment, this is not the case in the experiments with 
mismatched training and test densities, for which their performance degrades 
further. 

Taken together, these evaluations indicate that the ODO method is more 
accurate and more robust than the HMM method for detecting or counting 
events in polyphonic bird recordings such as those we have studied. The method 
can be used for any data with events well-characterised by ‘landmarks’ such 
as onsets and offsets, including animal and human sounds. However we note 
that all the systems evaluated here showed quite some decay in performance 
when evaluated with mismatched event densities. Improving these polyphonic 
detection paradigms to be robust to these wide ranges of event density remains 
as future work. Good detector components must be a key to strong performance: 
for the present work we used simple detection methods using spectral patches 
as data; improvements such as feature learning m could improve detection 
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Figure 4: Count estimation results (RMS error of counts in ten-second time 
windows). The y-axis is inverted so that upwards means better performance, to 
match Figure 


performance. It also remains to evaluate systems on a wider range of event- 
annotated audio recordings. 
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